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**************************** ****************************^ 
Title 37. Code of Federal Re gulations. fil.9(c-fl 

(c) An independent inventor as used in this chapter means any inventor who (1) has not 
assigned, granted, conveyed, or licensed, and (2) is under no obligation under contract or law to assign, 
grant, convey, or license, any rights in the invention to any person who could not likewise be classified 
as an independent inventor if that person had made the invention, or to any concern which would not 
quality as a small business concern or a nonprofit organization under this section. 

(d) A small business concern as used in this chapter means any business concern meeting the 
size standards set forth in 13 C.F.R. Part 121 to be eligible for reduced patent fees. Questions related to 
size standards for a small business concern may be directed to: Small Business Administration, Size 
Standards Staff, 409 Third Street, SW, Washington, DC 2041. 

(e) A nonprofitoi^anizationasusedmmischaptermeans(l)auiiiversity or other institution 
of higher education located in any country; (2) an organization of the type described in section 501(c)(3) 
of the Internal Revenue Code of 1954 (26U.S.C. 501(c)(3)) and exempt from taxation under section 50 1(a) 
of the Internal Revenue Code (26 U.S.C. 501(a)); (3) any nonprofit scientific or educational organization 
qualified under a nonprofit organization statute of a state of this country (35 U.S.C. 201(i)); or (4) any 
nonprofit organization located in a foreign country which would qualify as a nonprofit organization under 
paragraphs (e) (2) or (3) of this section if it were located in this country. 

(f) A small entity as used in this chapter means an independent inventor, a small business 
concern or a nonprofit organization eligible for reduced patent fees. 

**************************************************** 
Title 13. Code of Federa l Halations. 8121.12 

121.12 Small business for paying reduced patent fees, (a) Pursuant to Pub. L. 97-247, a small business 
concern for purposes of paying reduced fees under 35 U.S. Code 41 (a) and (b) to the Patent and 
Trademark Office means any business concern (1) whose number of employees, including those ot its 
affiliates does not exceed 500 persons and (2) which has not assigned, granted, conveyed, or licensed, and 
is under no obligation under contract or law to assign, grant, convey or license, any rights m the invention 
to any person who could not be classified as an independent inventor if that person had made the invention, 
or to any concern which would not qualify as a small business concern or a nonprofit organization under 
this section For the purpose of this section concerns are affiliates of each other when either, directly or 
indirectly one concern controls or has the power to control the other, or a third party or parties controls 
or has the power to control both. The number of employees of the business concern is the average over 
the fiscal year of the persons employed during each of the pay periods of the fiscal year. Employees are 
those persons employed on a full-time, part-time or temporary basis during the previous fiscal year of the 
concern. 
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DISTRIBUTED CONTENT IDENTIFICATION SYSTEM 

INVENTORS: 

Mark Raymond Pace 
Brooks Cash Talley 

BACKGROUND OF THE INVENTION 
Field of the Invention 

The invention relates to the field of content identification for files on 
a network. 

5 

Description of the Related Art 

With the proliferation and growth of the Internet, content transfer 
between systems on both public and private networks has increased 
exponentially. While the Internet has brought a good deal of information 

10 to a large number of people in a relatively inexpensive manner, this 
proliferation has certain downsides. One such downside, associated with 
the growth of e-mail in particular, is generally referred to as "spam" e-mail. 
Spam e-mail is unsolicited e-mail which is usually sent out in large 
volumes over a short period of time with the intent of inducing the recipient 

1 5 into availing themselves of sales opportunities or "get rich quick" schemes. 

To rid themselves of spam, users may resort to a number of 
techniques. The most common is simple filtering using e-mail filtering 
which is built into e-mail client programs. In this type of filtering, the user 
will set up filters based on specific words, subject lines, source addresses, 

20 senders or other variables, and the e-mail client will process the incoming 
e-mail when it is received, or at the server level, and take some action 
depending upon the manner in which the filter is defined. 
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More elaborate e-mail filtering services have been established 
where, for a nominal fee, off-site filtering will be performed at a remote site. 
In one system, e-mails are forwarded offsite to a service provider and the 
automatic filtering occurs at the provider's location based on heuristics 
5 which are updated by the service provider. In other systems, offsite 
filtering occurs using actual people to read through e-mails and judge 
whether e-mail is spam or not. Other systems are hybrids, where 
heuristics are used and, periodically, real people review e-mails which are 
forwarded to the service to determine whether the e-mail constitutes 

10 "spam" within the aforementioned definition. In these hybrid services, 
personal reviews occur on a random basis and hence constitute only a 
spot check of the entire volume of e-mail which is received by the service. 
In systems where real people review e-mails, confidentiality issues arise 
since e-mails are reviewed by a third party who may or may not be under 

15 an obligation of confidentiality to the sender or recipient of the e-mail. 

In addition, forwarding the entire e-mail including attachments to an 
outside service represents a high bandwidth issue since effectively this 
increases the bandwidth for a particular e-mail by three times: once for the 
initial transmission, the second time for the transmission to the service and 

20 the third time from the service back to server for redistribution to the 
ultimate recipient. 

Further, senders of spam have become much more sophisticated at 
avoiding the aforementioned filters. The use of dynamic addressing 
schemes, very long-length subject lines and anonymous re-routing 

25 services makes it increasingly difficult for normal filtering schemes, and 
even the heuristics-based services discussed above, to remain constantly 
up-to-date with respect to the spammers 1 ever changing methods. 

Another downside to the proliferation of the Internet is that it is a 
very efficient mechanism for delivering computer viruses to a great number 

30 of people. Virus identification is generally limited to programs which run 
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and reside on the individual computer or server in a particular enterprise 
and which regularly scan files and e-mail attachments for known viruses 
using a number of techniques. 

5 SUMMARY OF THE INVENTION 

Hence, the object of the invention is to provide a content 
classification system which identifies content in an efficient, up-to-date 
manner. 

The further object of the invention is to leverage the content 
10 received by other users of the classification system to determine the 
characteristic of the content. 

Another object of the invention is to provide a service which quickly 
and efficiently identifies a characteristic of the content of a given 
transmission on a network at the request of the recipient. 
1 5 Another object of the invention is to provide the above objects in a 

confidential manner. 

A still further object is to provide a system which operates with low 
bandwidth. 

These and other objects of the invention are provided in the present 
20 invention. The invention, roughly described, comprises a file content 
classification system. In one aspect the system includes a digital ID 
generator and an ID database coupled to receive IDs from the ID 
generator. The system further includes a characteristic comparison routine 
identifying the file as having a characteristic based on ID appearance in 
25 the appearance database. 

In a particular embodiment, the file is an e-mail file and the system 
utilizes a hashing process to produce digital IDs. The IDs are forwarded 
to a processor via a network. The processor performs the characterization 
and determination steps. The processor then replies to the generator to 
30 enable further processing of the email based on the characterization reply. 
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ln a further aspect, the invention comprises a method for identifying 
a characteristic of a data file. The method comprises the steps of: 
generating a digital identifier for the data file and forwarding the identifier 
to a processing system; determining whether the forwarded identifier 

5 matches a characteristic of other identifiers; and processing the e-mail 
based on said step of determination. 

In yet another aspect, the invention comprises a method for 
providing a service on the Internet, comprising: collecting data from a 
plurality of systems having a client agent on the Internet to a server having 

10 a database; characterizing the data received relative to information 
collected in the database; and transmitting a content identifier to the client 
agent. In this aspect, said step of collecting comprises collecting a digital 
identifier for a data file. In addition, said step of characterizing comprises: 
tracking the frequency of the collection of a particular identifier; 

15 characterizing the data file based on said frequency; storing the 
characterization; and comparing collected identifiers to the known 
characterization 

20 BRIEF DESCRIPTION OF THE DRAWINGS 

The invention will be described with respect to the particular 
embodiments thereof. Other objects, features, and advantages of the 
invention will become apparent with reference to the specification and 
drawings in which: 

25 Figure 1 is a block diagram indicating the system in filtering e-mail 

to identify content in accordance with the prior art. 

Figure 2 is a block diagram illustrating the process of the present 
invention. 

Figure 3 is a block diagram illustrating in additional detail the 
30 method and apparatus of the present invention. 
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Figure 4 is a block diagram illustrating a second embodiment of the 
method and apparatus of the present invention. 

5 DETAILED DESCRIPTION 

The present invention provides a distributed content classification 
system which utilizes a digital identifier for each piece of content which is 
sought to be classified, and characterizes the content based on this ID. In 

1 0 one aspect of the system, the digital identifier is forwarded to a processing 
system which correlates any number of other identifiers through a 
processing algorithm to determine whether a particular characteristic for 
the content exists. In essence, the classification is a true/false test for the 
content based on the query for which the classification is sought. For 

1 5 example, a system can identify whether a piece of e-mail is or is not spam, 
or whether the content in a particular file matches a given criteria 
indicating it is or is not copyrighted material or contains or does not contain 
a virus. 

While the present invention will be discussed with respect to 
20 classifying e-mail messages, itwill be understood by those of average skill 
in the art that the data classification system of the present invention can 
be utilized to classify any sort of text or binary data which resides on or is 
transmitted through a system. 

Figure 1 is a high level depiction of the present invention wherein 
25 an e-mail sender 1 0 transmits an e-mail which is intercepted by a filtering 
process/system 1 5 before being forwarded to the sender. The system has 
the ability to act on the e-mail before the recipient 20 ever sees the 
message. 

Figure 2 illustrates the general process of the present invention in 
30 the e-mail context when an e-mail sender 10 transfers an e-mail to its 
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intended recipient 40, the message arrives at a first tier system 20 which 
in this example may represent an e-mail server. Normally (in the absence 
of the system of the present invention), the first tier system 20 will transmit 
an e-mail directly to the intended recipient when the recipient's e-mail 
5 client application requests transmission of the e-mail. In the present 
invention, a digital identifier engine on the first tier system cooperating with 
the e-mail server will generate a digital identifier which comprises, in one 
environment, a hash of at least a portion of the e-mail. The digital 
identifier is then forwarded to a second tier system 30. Second tier system 

10 30 includes a database and processor which determines, based on an 
algorithm which varies with the characteristic tested, whether the e-mail 
meets the classification of the query (e.g. is it spam or not?). 

Based on the outcome of this algorithm, a reply is sent from the 
second tier system 30 to the first tier system 20, where the system then 

15 processes the e-mail in accordance with the regenerated description by 
the user based on the outcome of the filter. The result can be as shown 
in Figure 2, the filtered e-mail product being forwarded to the e-mail 
recipient. Other options for disposition of the e-mail depending upon the 
outcome of the algorithm computed at second tier system 30 are described 

20 below. 

It should be understood with reference to Figures 1 and 2 that the 
external e-mail sender can be any source of electronic mail or electronic 
data sent to the filtering process from sources outside the system. The e- 
mail recipients 40 represent the final destination of electronic data that 

25 passes through the filtering process. 

In one aspect, the system may be implemented in executable code 
which runs on first tier system 20 and generates digital IDs in accordance 
with the MD5 hash fully described at http://www.w3.Org/TR/1 998/Rec-DSig- 
label/MD5_1J3. It should be recognized however that any hashing 

30 algorithm can be utilized. In one embodiment, the digital ID generated by 
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the MD5 hash is of the entire subject line up to the point where two spaces 
appear, the entire body, and the last 500 bytes of the body of the 
message. It should be further understood that the digital ID generated 
may be one hash, or multiple hashes, and the hashing algorithm may be 
performed on all or some portion of the data under consideration. For 
example, the hash may be of the subject line, some number of characters 
of the subject line, all of the body or portions of the body of the message. 
It should further be recognized that the digital ID is not required to be of 
fixed length. 

The first tier executable may be run as a separate process or as a 
plug-in with the e-mail system running on a first tier system 20. In one 
embodiment, the executable interfaces with a commonly used mail server 
on a running system such as a first tier system 20 is known as Sendmail ™. 
A common set of tools utilized with Sendmail™ is Procmail. 
(http://www.ii.com/internet/robots/procmail). In one aspect of the system 
of the present invention, the executable may interface with Sendmail™ and 
Procmail. In such an embodiment, a configuration file (such as a 
sendmail. cf) includes a line of code which instructs the Procmail server 
program to process incoming e-mails through the first tier site e-mail 
executable to generate and transport digital IDs to the second tier system, 
rece j ve its reply, and instruct the Procmail to process or delete the 
message, as a result of the reply message. 

It should be understood that the executable may be written in, for 
example, perl script and can be designed to interact with any number of 
commercial or free e-mail systems, or other data transfer systems in 
applications other than e-mail. 

The digital ID usage in this context reduces bandwidth which is 
required to be transported across the network to the second tier system. 
Typically, the ID will not only contain the hashed data, but may include 
versioning information which informs the second tier system 30 of the type 
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of executable running on the first tier system 20. 

In addition, the reply of the second tier system to the first tier 
system may be, for example, a refusal of service from the second tier 
system 30 to the first tier system 20 in cases where the first tier system is 
5 not authorized to make such requests. It will be recognized that revenue 
may be generated in accordance with the present invention by providing 
the filtering service (i.e. running the second tier service process and 
maintaining the second tier database) for a fee based on volume or other 
revenue criteria. In this commercial context, the reply may be a refusal of 

10 service of the user of the first tier system 20 which has exceeded their 
allotted filtering quota for a given period. 

Figure 3 shows a second embodiment of the system of the present 
invention. In Figure 3, the first tier system is broken down into three 
components including a message preprocessing section 1 10, a message 

15 processing section 120, a configuration file DS10. In this example, the e- 
mail from sender 10 is first diverted to message preprocessing 110. 
Preprocessing algorithm is configured with rules from configuration file 
DS10. These rules are guidelines on how and when, for example, to 
generate digital IDs from the e-mail which is received. Message 

20 preprocessing receives the email from the e-mail sender 1 0 and generates 
digital IDs based on the preprocessing rules from DS10. DS10 is a 
configuration file which stores configuration rules (before preprocessing 
and postprocessing) for the first tier system 20. The message processing 
rules may include guidelines on how to dispose of those e-mails classified 

25 as spam. For example, a message may be detected, and may be 
forwarded to a holding area for electronic mail that has been deemed to be 
spam by second tier system 30, have the word "SPAM" added to the 
subject line, moved to a separate folder, and the like. In this example, 
message preprocessing rules include rules which might exempt all e-mails 

30 from a particular destination or address from filtering by the system. If a 
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message meets such exemption criteria, the message is automatically 
forwarded, as shown on line 50, directly to message processing 120 for 
forwarding directly onto the e-mail recipient 40. Such rules may also 
comprise criteria for forwarding an e-mail directly to a rejected message 
5 depository DS20. 

If a preprocessing rule does not indicate a direct passage of a 
particular e-mail through the system, one or more digital identifiers will be 
generated as shown at line 66 and transmitted to the second tier system 
30\ In the example shown at Figure 3, second tier system 30* includes a 

10 second tier server 210 in a third tier database 220. In this example, the 
second tier server relays digital IDs and replies between preprocessing 
and the message processing 120. The example shown in Figure 3 is 
particularly useful in an Internet based environment where the second tier 
server 210 may comprise a web server which is accessible through the 

1 5 Internet and the third tier database 220 is shielded from the Internet by the 
second tier server through a series of firewalls or other security measures. 
This ensures that the database of digital ID information which is compiled 
at the third tier database 220 is free from attack from individuals desirous 
of compromising the security of this system. 

20 In this case, second tier server 210 forwards the digital ID directly 

to the third tier database which processes the IDs based on the algorithm 
for testing the data in question. The third tier database generates a reply 
which is forwarded by the second tier server back to message processing 
1 20. Message processor 1 20 can then act on the e-mail by either sending 

25 filtered e-mail to the e-mail recipient, sending the filtered e-mail to the 
rejected message depository DS20 or acting on the message in 
accordance with user-chosen configuration settings specified in 
configuration file DS10. 

In the environment shown in Figure 3, the configuration file DS1 10 

30 on the first tier allows other decisions about the e-mail received from the 
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e-mail sender 1 0 to be made, based on the reply from second tier 30. For 
example, in addition to deleting spam e-mail, the subject line may be 
appended to indicate that the e-mail is "spam," the e-mail may be held in 
a quarantine zone for some period of time, an auto reply generated, and 
the like. In addition, the message preprocessing and message processing 
rules allow decisions on e-mail processing to account for situations where 
second tier system 30' is inaccessible. Decisions which may be 
implemented in such cases may include "forward all e-mails," "forward no 
e-mails," "hold for further processing," and the like. 

In an Internet based environment, the second tier server 30 may 
transmit a digital identification and other information to the third tier 
database 220 by means of the HTTP protocol. It should be recognized 
that other protocols may be used in accordance with the present invention. 
The third tier database 220 may be maintained on any number of different 
commercial database platforms. In addition the third tier database may 
include system management information, such as client identifier tracking, 
and revenue processing information. In an unique aspect of the present 
invention in general, the digital IDs in third tier database 22 are maintained 
on a global basis. That is, all first tier servers which send digital IDs to 
second tier servers 210 contribute data to the database and the 
processing algorithm running on the third tier system. In one embodiment, 
where spam determination is the goal, the algorithm computes, for 
example, the frequency with which a message (or, in actuality, the ID for 
the message), is received within a particular time frame. For example, if 
a particular ID indicating the same message is seen some number of times 
per hour, the system classifies the message (and ID) as spam. All 
subsequent IDs matching the ID classified as spam will now cause the 
system 30' to generate a reply that the e-mail is spam. Each client having 
a first tier system 20' which participates in the system of the present 
invention benefits from the data generated by other clients. Thus, for 
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example, if a particular client receives a number of spam e-mails meeting 
the frequency requirement causing the system to classify another client 
having a first tier system 20' which then sees a similar message will 
automatically receive a reply that the message is spam. 

It should be recognized that in certain cases, large reputable 
companies forward a large block of e-mails to a widespread number of 
users, such as, for example information mailing list servers specifically 
requested by e-mail receivers. The system accounts for such mailing list 
application on both the area and second tier system levels. Exceptions 
may be made in the algorithm running on the third tier database 220 to 
take into account the fact that reputable servers should be allowed to send 
a large number of e-mails to a large number of recipients at the destination 
system 20\ Alternatively, or in conjunction with such exceptions, users 
may define their own exceptions via the DS10 configuration. As a service, 
any number of acceptable sources such as, for example, the Fortune 1 000 
companies' domain names may be characterized as exempted "no spam" 
sites, and users can choose to "trust" or "not trust" server side settings. 

While the aforementioned embodiment utilizes a frequency 
algorithm to determine whether a message is spam, additional 
embodiments in the algorithm can analyze messages for the frequency of 
particular letters or words, and/or the relationship of the most common 
words to the second most common words in a particular message. Any 
number of variants of the algorithm may be used. 

It should be further recognized that the second tier server can be 
utilized to interface with the value added services, such as connecting the 
users to additional mailing lists and reference sources , providing feedback 
on the recipients 1 characteristics to others, and the like. 

Figure 4 shows a further embodiment of the invention and details 
how the server side system manipulates with the digital identifiers. In 
Figure 4, the embodiment includes a DS10 configuration file which 
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provides message exemption criteria and message processing rules to 
both message preprocessing and routine 110 message processing 120, 
respectively. Message preprocessing 110 may be considered as two 
components: message exemption checking 1 1 1 and digital ID creation 
112. Both of these components function as described above with respect 
to Figure 3 allowing for exempt e-mails to be passed directly to an e-mail 
recipient 40, or determining whether digital IDs need to be forwarded to 
second tier server 210. Replies are received by message processing 
algorithm 120 is acted on by rule determination algorithm 121, and e-mail 
filtering 123. 

At the second tier system 30', digital IDs transmitted from second 
tier processor 210 are transmitted to a digital ID processor 221. In this 
embodiment, processor 221 increments counter data stored in DS30 for 
each digital ID per unit time. As the volume of messages processed by 
database 220 can be quite large, the frequency algorithm may be adjusted 
to recognize changes in the volume of individual messages seen as a 
percentage of the total message volume of the system. 

The frequency data stored at DS30 feeds a reply generator 222 
which determines, based on both the data in the DS30 and particular 
information for a given client, (shown as data record DS40) whether the 
reply generated and forwarded to second tier server 210 should indicate 
that the message is spam or not. Configuration file DS40 may include 
rules, as set forth above, indicating that the reply from the second tier 
server 210 is forwarded to rule determination component of message 
processor 120 which decides, as set forth above, how to process the rule 
if it is in fact determined that it is spam. The filtered e-mail distribution 
algorithm forwards the e-mail directly to the e-mail client 40 or to the 
rejected message repository as set forth above. 

A key feature of the present invention is that the digital IDs utilized 
in the data identifier repository DS30 are drawn from a number of different 
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first tier systems. Thus, the greater number of first tier systems which are 
coupled to the second tier server and subsequent database 220, the more 
powerful the system becomes. 

It should be further recognized that other applications besides the 
detection of spam e-mail include the detection of viruses, and the 
identification of copyrighted material which are transmitted via the network. 

Moreover, it should be recognized that the algorithm for processing 
digital identifiers and the data store DS30 are not static, but can be 
adjusted to look for other characteristics of the message or data which is 
being tested besides frequency. 

Hence, the system allows for leveraging between the number of 
first tier systems or clients coupled to the database to provide a filtering 
system which utilizes a limited amount of bandwidth while still providing a 
confidential and powerful e-mail filter. It should be further recognized that 
the maintainer of the second and third tier systems may generate revenue 
for the service provided by charging a fee for the service of providing the 
second tier system process. 

Still further, the system can collect and distribute anonymous 
statistical data about the content classified. For example, where e-mail 
filtering is the main application of the system, the system can identify the 
percentage of total e-mail filtered which constitutes spam, where such e- 
mail originates, and the like, and distribute it to interested parties for a fee 
or other compensation. 
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CLAIMS 



What is claimed is: 

1 1 . A file content classification system comprising: 

2 a digital ID generator; 

3 an ID appearance database coupled to receive IDs from the ID 

4 generator; and 

5 a characteristic comparison routine identifying the file as having a 

6 characteristic based on ID appearance in the appearance database. 

1 2. The content classification system of claim 1 wherein said ID 

2 generator comprises a hashing algorithm. 

1 3. The content classification system of claim 2 wherein said hashing 

2 algorithm is the MD5 hashing algorithm. 

1 4. The content classification system of claim 1 wherein said ID 

2 appearance database tracks the frequency of appearance of a digital ID. 

1 5. The content classification system of claim 1 further including a 

2 plurality of digital ID generators on different systems all coupled to and 

3 providing IDs to said ID appearance database. 

1 6. The content classification system of claim 5 wherein said plurality 

2 of digital ID generators are coupled to said database via a combination of 

3 public and private networks. 

1 7. The content classification system of claim 6 wherein said database 

2 is coupled to an intermediate server which is coupled to said plurality of 
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3 generators. 

1 8. The content classification system of claim 6 wherein said 

2 intermediate server is a web server. 

1 9. The content classification system of claim 1 wherein said 

2 characteristic comprises junk e-mail and said characteristic is defined by 

3 a frequency of appearance of a digital ID. 

1 1 0. A method for identifying a characteristic of a data file, comprising: 

2 generating a digital identifier for the data file and forwarding the 

3 identifier to a processing system; 

4 determining whether the forwarded identifier matches a 

5 characteristic of other identifiers; and 

6 processing the email based on said step of determining. 

1 11. The method of claim 1 0 wherein said step of generating comprises 

2 hashing at least a portion of the data file. 

1 12. The method of claim 11 wherein said step of hashing comprises 

2 using the MD5 hash. 

1 1 3. The method of claim 1 1 wherein said step of generating comprises 

2 hashing multiple portions of the data file. 

1 14. The method of claim 10 wherein said data file is an email message 

2 and said step of determining comprises determining whether said email is 

3 spam. 

1 1 5. The method of claim 1 0 wherein said step of determining identifies 
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2 said e-mail as spam by tracking the rate per unit time a digital ID is 

3 generated. 

1 16. The method of claim 1 0 wherein said step of generating comprises 

2 generating IDs at a plurality of source systems all coupled via a network 

3 to at least one processing system performing the determining step. 

1 17. The method of claim 1 6 wherein said step of processing comprises 

2 instructing said plurality of source systems to perform an action with the 

3 email based on said determining step. 

1 1 8. A method of filtering an email message, comprising: 

2 processing the message to provide a digital identifier; 

3 comparing the digital identifier to a characteristic database of digital 

4 identifiers to determine whether the message has said characteristic; and 

5 processing the message based on said step of comparing. 

1 19. The method of claim 1 8 wherein said step of processing occurs on 

2 at least one first system, and said step of comparing occurs on a second 

3 system. 

1 20. The method of claim 1 9 wherein said step of processing occurs on 

2 a plurality of first systems. 

1 21 . The method of claim 1 9 wherein said at least one first system and 

2 second system are coupled by the Internet. 

1 22. The method of claim 1 8 wherein said step of comparing comprises 

2 determining the frequency of a particular ID occurring in a time period, 

3 classifying said ID as having a characteristic, and comparing digital 



Attorney Docket No.;PACE01000US0LEV 
lev/pace/1 000.001 



-17- 

4 identifiers to said classified IDs. 



1 23. A file content classification system, comprising: 

2 a first system having a file to be classified; 

3 an file ID generator on the fist system; 

4 a database on a second system coupled to the ID generator to 

5 receive IDs generated by the ID generator; 

6 a comparison routine on the second system classifying the ID 

7 relative to the database as meeting or not meeting a characteristic. 



1 24. The system of claim 23 including a plurality of first systems each 

2 including a respective file ID generator coupled to the database on the 

3 second system. 

1 25. The system of claim 24 wherein the plurality of first systems is 

2 coupled to the second system via the Internet. 

1 26. The system of claim 25 wherein the second system comprises a 

2 web server interface system and a database system, wherein the database 

3 system is isolated from the Internet by the web server system. 

1 27. A content classification system for a first and second computer 

2 coupled by a network, comprising: 

3 a client agent file identifier generator on the first computer; and 

4 a server comparison agent and data-structure on the second 

5 computer receiving identifiers from the client agent and providing replies 

6 to the client agent; 

7 wherein the client agent processes the file based on replies from the 

8 server comparison agent. 
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1 28. A method for providing a service on the Internet, comprising: 

2 collecting data from a plurality of systems having a client agent on 

3 the Internet to a server having a database; 

4 characterizing the data received relative to information collected in 

5 the database; and 

6 transmitting a content identifier to the client agent. 

1 29. The method of claim 28 wherein said step of collecting comprises 

2 collecting a digital identifier for a data file. 

1 30. The method of claim 28 wherein said data file is an e-mail, 

1 31. The method of claim 29 wherein said step of characterizing 

2 comprises: 

3 tracking the frequency of the collection of a particular identifier; 

4 characterizing the data file based on said frequency; 

5 storing the characterization; and 

6 comparing collected identifiers to the known characterization. 
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ABSTRACT 



A file content classification system includes a digital ID generator 
and an ID appearance database coupled to receive IDs from the ID 
generator. The system further includes a characteristic comparison routine 
identifying the file as having a characteristic based on ID appearance in 
5 the appearance database. In a further aspect, a method for identifying a 
characteristic of a data file is disclosed. The method comprises the steps 
of: generating a digital identifier for the data file and forwarding the 
identifier to a processing system; determining whether the forwarded 
identifier matches a characteristic of other identifiers; and processing the 
10 e-mail based on said step of determination. 
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Title: DISTRIBUTED CONTENT 

IDENTIFICATION SYSTEM 



DECLARATION FOR PATENT APPLICATION 

As a below named inventor, I hereby declare that my residence, post office address and 
citizenship are as stated below next to my name; I believe that I am the original, first and sole 
inventor (if one name is listed below), first and joint inventor (if plural names are listed below) 
of the subject matter which is claimed and for which a patent is sought on the invention 
entitled: 

DISTRIBUTED CONTENT IDENTIFICATION SYSTEM 

the specification of which (check applicable ones): 



X is filed herewith; 

was filed with the above-identified "Filed" date and "SC/Serial 

No." 

was amended on (or amended through) . 



I hereby state that I have reviewed and understand the contents of the above-identified 
specification, including the claims, as amended by any amendment(s) referred to above. I 
acknowledge the duty to disclose information which is material to the examination of the 
application in accordance with Title 37, Code of Federal Regulations, §1.56. 

I hereby declare that all statements made herein of my own knowledge are true and 
that all statements made on information and belief are believed to be true, and further that 
these statements were made with the knowledge that willful false statements and the like so 
made are punishable by fine or imprisonment, or both, under §1001 of Title 18 of the United 
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application or any patent issuing thereon. 



(1) Full name of sole 

or first inventor: Mark Raymond Pace 

(1) Residence: 42 15 th Avenue, San Mateo, California 94402 



(1) Post Office Address: same 



(1) Citizenship: United States of Ameyca 
(1) Inventor's signature: ( \JI ^ 

(1) Date: \^l c 1*t2.+ I 

(2) Full name of second 

joint inventor: Brooks Cash Tallev 

(2) Residence: 40 15 th Avenue, San Mateo, California 94402 




(2) Post Office Address: same 



(2) Citizenship: United States of America 



(2) Inventor's signature: 
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Title 37, Code of Federal Regulations, §1.56 

SECTION 1 .56. DUTY TO DISCLOSE INFORMATION 
MATERIAL TO PATENTABILITY 



(a) A patent by its very nature is affected with a public 
interest. The public interest is best served, and the most 
effective patent examination occurs when, at the time an 
application is being examined, the Office is aware of and 
evaluates the teachings of all information material to 
patentability. Each individual associated with the filing and 
prosecution of a patent application has a duty of candor and 
good faith in dealing with the Office, which includes a duty 
to disclose to the Office all information known to that 
individual to be material to patentability as defined in this 
section. The duty to disclose information exists with respect 
to each pending claim until the claim is cancelled or 
withdrawn from consideration, or the application becomes 
abandoned. Information material to the patentability of a 
claim that is cancelled or withdrawn from consideration 
need not be submitted if the information is not material to 
the patentability of any claim remaining under consideration 
in the application. There is no duty to submit information 
which is not material to the patentability of any existing 
claim. The duty to disclose all information known to be 
material to patentability is deemed to be satisfied if all 
information known to be material to patentability of any 
claim issued in a patent was cited by the Office or submitted 
to the Office in the manner prescribed by §§1 .97(b)-(d) and 
1.98.* However, no patent will be granted on an 
application in connection with which fraud on the Office 
was practiced or attempted or the duty of disclosure was 
violated through bad faith or intentional misconduct. The 
Office encourages applicants to carefully examine: 

(1) prior art cited in search reports of a foreign patent 
office in a counterpart application, and 

(2) the closest information over which individuals 
associated with the filing or prosecution of a patent 
application believe any pending claim patentably 
defines, to make sure that any material information 
contained therein is disclosed to the Office. 

(b) Under this section, information is material to 
patentability when it is not cumulative to information 
already of record or being made of record in the 
application, and 



(1) It establishes, by itself or in combination with 
other information, a prima facie case of unpatentability 
of a claim; or 

(2) It refutes, or is inconsistent with, a position the 
applicant takes in: 

(i) Opposing an argument of unpatentability 
relied on by the Office; or 

(ii) Asserting an argument of patentability. 

A prima facie case of unpatentability is established when the 
information compels a conclusion that a claim is 
unpatentable under the preponderance of evidence, burden- 
of-proof standard, giving each term in the claim its broadest 
reasonable construction consistent with the specification, 
and before any consideration is given to evidence which 
may be submitted in an attempt to establish a contrary 
conclusion of patentability. 

(c) Individuals associated with the filing or prosecution of 
a patent application within the meaning of this section are: 

(1) Each inventor named in the application; 

(2) Each attorney or agent who prepares or prosecutes 
the application; and 

(3) Every other person who is substantively involved 
in the preparation or prosecution of the application and 
who is associated with the inventor, with the assignee 
or with anyone to whom there is an obligation to assign 
the application. 

(d) Individuals other than the attorney, agent or inventor 
may comply with this section by disclosing information to 
the attorney, agent, or inventor. 



* §§1 .97(b)-(d) and 1 .98 relate to the timing and manner in 
which information is to be submitted to the Office. 
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